Probability Theory
Table of Contents
- 1. Probability
- 2. Probability Space
- 3. Random Variable
- 4. Probability Distribution
- 4.1. Properties
- 4.2. Probability Mass Function
- 4.3. Probability Density Function
- 4.4. Cumulative Probability Function
- 4.5. Normalization and Denormalization
- 4.6. Combination
- 4.7. Instances
- 5. Parametric Family
- 6. Stochastic Process
- 7. References
- Probability Calculus
1. Probability
1.1. Marginal Probability
- Probability distribution of a subset of a larger collection of random variables.
1.2. Conditional Probability
- Probability contingent upon the values of the other variables.
- \[ p_{Y|X}(y\mid x) := \mathrm{P}[Y=y\mid X=x] = \frac{\mathrm{P}(\{X=x\}\cap \{Y=y\})}{\mathrm{P}(\{X=x\}} \]
- \[ f_{Y|X}(y\mid x) = \frac{f_{X,Y}(x,y)}{f_X(x)} \]
1.3. Law of Total Probability
Relation between marginal probability and conditional probability. \[ \mathrm{P}(A) = \sum_k \mathrm{P}(A\cap B_k) = \sum_k \mathrm{P}(A\mid B_k)\mathrm{P}(B_k) \] Genreally, \[ \mathrm{P}(A) = \int_\Omega\mathrm{P}(A\mid X)\,\mathrm{dP}. \]
Further, \[ \mathrm{P}(A\mid B) = \sum_n \mathrm{P}(A \mid C_n)\,\mathrm{P}(C_n\mid B). \]
1.4. Bayesian Probability
- Interpretation of the probability as reasonable expectation, instead of frequency or propensity.
- It represents a state of knowledge or as quantification of a personal belief.
2. Probability Space
2.1. Definition
- A measure space such that the measure of the whole space is one.
- It is a triple \((\Omega, \Sigma, \mathrm{P})\) consists of
- The sample space \(\Omega\): an arbitrary non-empty set,
- The event space \(\Sigma\): σ-algebra on \(\Omega\),
- \(\Sigma\) for sigma algebra. \(\mathcal{F}\) is also often used instead, for filtration. or \(\mathcal{A}\) by conventions.
- The probability measure \(\mathrm{P}: \mathcal{F} \to [0, 1]\).
2.2. Probability Measure
2.2.1. Definition
- Probability measure \(\mathrm{P}\) is a measure
over \(\Omega\), with:
- \(\mathrm{P}: \sigma(\Omega) \to [0, 1]\), with \(\mathrm{P}(\varnothing) = 0\) and \(\mathrm{P}(\Omega) = 1\).
- Countable Additivity: \[ \mathrm{P}\left(\bigcup_{i\in \mathbb{N}}E_i\right) = \sum_{i\in \mathbb{N}}\mathrm{P}({E_i}) \] where \(\{E_i\}\) are pairwise disjoint sets.
- The validity of this definition of a probability measure is
precisely given by the Kolmogorov axioms.1
- Non-negativity
- Unit measure
- σ-additivity
2.2.2. Notations
- Probability that a random variable \(X\) takes a value in a measurable set \(S\subseteq E\) is written as \[ \mathrm{P}[X\in S] := \mathrm{P}(\{\omega \in \Omega\mid X(\omega)\in S\}). \]
3. Random Variable
- Kolmogorov Definition
- Random variable is a measurable function from the sample space to a measure space \(E\), often taken to be \(\mathbb{R}\): \[ X: \Omega \to E. \]
4. Probability Distribution
- Probability distribution forgets the probability space, and only remembers the output values of a random variable.
4.1. Properties
4.1.1. Mean
4.1.2. Variation
4.1.3. Skewness
- 왜도
4.1.4. Kurtosis
- 첨도
4.1.5. Absolutely Continuous
- Probability function whose domain has infinite elements.
- Random variable \(X\) is absolutely continuous, if there exists a function \(f_X\) such that for each interval \([a,b] \in \mathbb{R}\): \[ \mathrm{P}[a\le X \le b] = \int_a^b f_X(x)\,dx \]
4.2. Probability Mass Function
- The probability mass function \(p_X(x)\) is defined as \[ p_X(x) := \mathrm{P}[X=x]. \]
4.3. Probability Density Function
4.3.1. Definition
- The probability density function of a random variable \(X\) that takes values in a measure space \((E, \Sigma, \mu)\) is the Radon-Nikodym derivative of the pushforward measure \(X_*\mathrm{P}\) with respect to the reference measure \(\mu\): \[ f_X(x) := \frac{d X_*\mathrm{P}}{d\mu} \]
- That is, \[ X_*\mathrm{P}(A) = \int_A f_X\,d\mu \] for any measurable set \(A\in \Sigma\).
4.3.2. Properties
- For a real-valued random variable, it is absolutely continuous
univariate distribution satisfying:
- \[ f_X(x) = \frac{dF_X}{dx}, \]
4.4. Cumulative Probability Function
- The cumulative probability function of a real-valued random variable is \[ F_X(x) := \mathrm{P}[X\le x]. \]
4.5. Normalization and Denormalization
- The random variable \(X\) is contravariant, and the underlying
probability function is covariant:
- \[ Z = \frac{X - \mu}{\sigma} \]
- \[
f_Z(x) = \sigma f_X(\sigma x + \mu)
\]
- The \(\sigma\) factor is the normalization constant, to compensate for \(f_X\) being scaled down in the \(x\) direction by the factor of \(\sigma\).
- It is the inverse of the normalization:
- \[ X = \sigma Z + \mu \]
- \[ f_X(x) = \frac{1}{\sigma}f_Z\left(\frac{x - \mu}{\sigma}\right) \]
- The random variable and probability distribution transforms oppositely.
4.6. Combination
4.6.1. Distribution of Sum
- \[ f_{X+Y}(x) = (f_X * f_Y)(x) \]
- where \(*\) is the convolution.
- List of convolutions of probability distributions - Wikipedia
4.6.2. Product Distribution
- \[ f_{XY} = \int_{-\infty}^\infty f_{X,Y}(x, z/x)\frac{1}{|x|}\,dx \]
- Distribution of the product of two random variables - Wikipedia
4.6.3. Ratio Distribution
- \[ f_{X/Y} = \int_{-\infty}^\infty f_{X,Y}(zy, y)|y|\,dy \]
- Ratio distribution - Wikipedia
4.7. Instances
- The unexpected probability result confusing everyone - YouTube
- For independent uniformly distributed random variables \(X\) and \(Y\), the probability distribution of the \(\max(X,Y)\) is equal to the probability distribution of \(\sqrt{X}\).
- Similarly, the probability distribution of the \(\max(X_1, X_2,\dots, X_n)\) is equal to the probability distribution of \(\sqrt[n]{X_1}\)
- Shockingly, the probability distribution of the \(XY^Z\) is uniform, for the independent uniformly distributed random variables \(X,Y,Z\).
4.7.1. Discrete Distributions
4.7.1.1. Bernoulli Distribution
4.7.1.2. Binomial Distribution
4.7.1.3. Multinomial Distribution
- Higher dimensional binomial distribution.
- \[
f(x_1,\dots,x_k;n,p_1,\dots,p_k) = \binom{n}{x_1,\dots,x_k}\prod_{i=1}^k p_i^{x_i}
\]
- using the multinomial coefficient.
- Multinomial distribution - Wikipedia
4.7.1.4. Poisson Distribution
- \[ f(k;\lambda) = \frac{\lambda^ke^{-\lambda}}{k!} \]
- Poisson distribution - Wikipedia
4.7.1.5. Geometric Distribution
4.7.2. Continuous Distributions
4.7.2.1. Normal Distribution
- Gaussian Distribution
- \(\mathcal{N}(\mu, \sigma^2)\)
4.7.2.1.1. Probability Density Function
- \[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
- Normalizing the area of the Gaussian function:
\[
e^{-x^2} \rightsquigarrow \frac{1}{\sqrt{\pi}}e^{-x^2}
\]
- This was the definition of the standard normal by Carl Friedrich Gauss
- It has standard deviation of \(1/\sqrt{2}\).
- Denormalizing the probability distribution to mean \(\mu\) and standard deviation \(\sigma\): \[ \frac{1}{\sqrt{\pi}}e^{-x^2} \rightsquigarrow \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}. \]
4.7.2.2. Chi-Squared Distribution
4.7.2.2.1. Definition
- For independent, standard normal random variables \(Z_1, \dots, Z_k\), \[ Q = \sum_{i=1}^k Z_i^2 \] is distributed according to the chi-squared distribution with \(k\) degrees of freedom: \[ Q \sim \chi^2(k). \]
- This distribution arises in least square method.
- Chi-squared distribution - Wikipedia
4.7.2.3. F-Distribution
- ((66cc6b5d-2778-4bfb-9840-7a4d91f147df))
4.7.2.4. Student's T-Distribution
- T-Distribution
- The name is from William Sealy Gossett.
- It has fat tails. The shape of the t-distribution approaches the standard normal distribution, as the sample size increases.
- It is a parametric family \(t_{\rm DF}\) with respect to the degrees of freedom which is directly related to the sample size.
4.7.2.5. Cauchy Distribution
- Lorentz Distribution, Cauchy-Lorentz Distribution, Lorentzian Function, Breit-Wigner Distribution
- \[ f(x; x_0, \gamma) = \frac{1}{\pi\gamma\left[1+\left(\frac{x-x_0}{\gamma}\right)^2\right]}. \]
- Its mean is undefined.
4.7.2.6. Exponential Distribution
- Negative Exponential Distribution
- In terms of rate \(\lambda\):
- \[ f(x;\lambda) = \lambda e^{-\lambda x} \]
- with \(f(x;\lambda) = 0\) if \(x<0\).
- In terms of scale parameter \(\beta = 1/\lambda\):
- \[ f(x;\beta) = \frac1\beta e^{-x/\beta} \]
- The distance between consecutive events in a Poisson point process.
- Continuous Geometric Distribution
4.7.2.7. Beta Distribution
4.7.2.7.1. Defintion
- \[
f(x; \alpha, \beta) := \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathrm{B}(\alpha, \beta)}
\]
- where \(\mathrm{B}(\alpha,\beta)\) is the beta function.
4.7.2.7.2. Properties
- It is the probability distribution of the estimator \(\hat{p}\) of
the probability of observing a positive event after observing
\(\alpha-1\) positive events and \(\beta-1\) negative events.
- \[ \hat{p} = \frac{\alpha-1}{\alpha + \beta-2} \sim \mathcal{Be}(\alpha, \beta) \]
- \[ \mathcal{Be}(\alpha, \beta) = \mathrm{P}[X_{\alpha+\beta-1} = \omega_+ \mid X_{\sigma(1)} = \cdots = X_{\sigma(\alpha-1)} = \omega_+, X_{\sigma(\alpha)} =\cdots = X_{\sigma(\alpha+\beta - 2)} = \omega_-] \]
- where \(X_i\)s are independent and identically distributed (iid) random variables, with unknown probability distribution.
- The harmonic mean is symmetric:
\[
H_X = H_{1-X}
\]
- where \(H_X\) is defined to be \[ H_X := \frac{1}{\mathrm{E}\left[\frac{1}{X}\right]} \]
- Concentration \(\kappa := \alpha+\beta\)
- mode \[ \omega = \frac{\alpha-1}{\alpha+\beta -2} \]
- Variance
\[
\sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha+\beta +1)}
\]
- It is asymptotically equal to the variance of the sample mean \(\bar{x}\) of random variables distributed as Bernoulli distribution. \[ \frac{\hat{p}\hat{q}}{n} = \widehat{\sigma^2[\hat{p}]}, \quad\sigma^2[\bar{x}]= \sigma^2[\hat{p}] \]
4.7.2.7.3. Beta Prime Distribution
- The probability distribution of the estimator of odds.
5. Parametric Family
- Explaining Parametric Families - YouTube
- A parametric family is a statistical model of an unknown probability distribution, which one can do analysis on.
6. Stochastic Process
- A sequnece of random variables \( X = (X_i)_i \).
6.1. Quadratic Variation
- One kind of variation of a stochastic process.
6.1.1. Definition
\[ [X]_t = \lim_{\Vert P\Vert \to 0} \sum_{k=1}^n(X_{t_k} - X_{t_{k-1}})^2 \] where \(P\) is the partition over the interval \([0, t]\).
6.1.2. Covariation
- Cross-Variance
6.1.2.1. Definition
\[ [X, Y]_t =: \lim_{\Vert P\Vert \to 0}\sum_{k=1}^{n}(X_{t_k} - X_{t_{k-1}})(Y_{t_k} - Y_{t_{k-1}}) \]
6.1.2.2. Properties
- By the polarization identity: \[ [X, Y]_t = \frac{1}{2}([X + Y]_t - [X]_t - [Y]_t). \]
- For the semimartingales: \[ d(X_tY_t) = X_{t-}dY_t + Y_{t-}dX_t + dX_tdY_t. \] where \(dX_tdY_t := d[X, Y]_t\).
6.2. Martingale
A stochastic process is called martingale, if the expected value of the immediate future is equal to the value of the current variable.
6.2.1. Definition
A discrete-time stochastic process \( X \) is martingale if, for any time \(n\):
- \[ \operatorname{E}[|X_n|] < \infty, \]
- \[ \operatorname{E}[X_{n+1}\mid X_1,\dots,X_n ] = X_n. \]
Generally, a stochastic process \(Y: T\times \Omega \to S\), where \(S\) is a Banach space with norm, is a martingale with respect to filtration \(\Sigma_{*}\), and probability measure \(\mathrm{P}\), if:
- \( \Sigma_{*} \) is a filtration of the event space \(\Sigma\) of the underlying probability space \((\Omega,\Sigma, \mathrm{P})\).
- \(Y\) is adapted to the filtration \(\mathcal{F}\), that is, for each \(t\) in the index set \(T\), the random variable \(Y_t\) is a \( \Sigma_t \)-measurable function.
- For each \(t\), \(Y_t\) lies in the \(L^p\) space \(L^1(\Omega, \mathcal{F}_t, \mathrm{P}; S)\): \[ \operatorname{E}[\Vert Y_t \Vert_S] < \infty. \]
- For all \(s\) and \(t\) with \(s
6.2.2. Local Martingale
A \( \Sigma_{*} \)-adapted stochastic process \( X \) is called an \( \Sigma_* \)-local martingale, if there exists a sequence of \( \Sigma_* \)-stopping times \( \tau_k\colon \Omega \to [0, \infty) \) such that
- the \( \tau_k \) are almost surely increasing: \( \mathrm{P}[\tau_k < \tau_{k+1}] = 1 \),
- the \( \tau_k \) diverge almost surely: \( \mathrm{P}[ \lim_{k\to \infty} \tau_k = \infty ] = 1 \),
- the stopped process \( X_t^{\tau_k} := X_{\min\{t, \tau_k\}} \) is an \( \Sigma_{*} \)-martingale for every \( k \).
Each \( \tau_k \) indicates a single stopping criterion, for example, "if the value of the stochastic process drops below zero".
6.2.3. Semimartingale
A real-valued stochastic process \( X \) is called a semimartingale, if it can be decomposed into a local martingale \( M \) and a cádlág adapted process \( A \) of locally bounded variation: \[ X_t = M_t + A_t. \]
6.3. Markov Chain
- Markov Process
A discrete-time Markov chain is a sequence of random variables \( (X_1, X_2, X_3, \dots ) \) where each \( X_{i}\colon \Omega \to S \) takes a value called state in the state space \( S \), with \( X_{i+1} \) only dependent to the previous variable \( X_i \).
In the continuous-time Markov chain, the index can also take a continuous value \( t \ge 0 \). In this case, the transition-rate matrix \( \mathbf{Q} \) is defined: \[ \mathbf{Q} := \lim_{h\to 0} \frac{\mathrm{P}[X_{t+h} = j | X_t = i] - \delta_{ij}}{h}. \] The matrix is also known as a Q-matrix, intensity matrix, or infinitesimal generator matrix.
6.4. Wiener Process
6.4.1. Definition
6.4.1.1. Canonical Characterization
- \(W_0 = 0\)
- \(W_t\) is almost surely continuous
- \(W_t\) has independent increments: \(W_{t_1} - W_{s_1}\) and \(W_{t_2}-W_{s_2}\) are independent.
- \(W_t - W_s \sim \mathcal{N}(0, t - s)\) for \(0\le s \le t\), where \(\mathcal{N}(\mu, \sigma^2)\) is the normal distribution.
6.4.1.2. Lévy Characterization
- \(W_0 = 0\)
- Almost surely continuous
- Martingale
- Quadratic variation: \( [W]_t = t \)
- This is the core property of the Itô calculus: \(dB_t^2 = dt\).
6.4.1.3. Spectral Characterization
- Sine series whose coefficients are independent \(\mathcal{N}(0,1)\) random variables. This is the result of the Kosambi-Karhunen-Loève Theorem.
6.4.2. Construction
- Scaling limit of a random walk. This is the result of the Math/Donsker's Theorem.
6.4.3. Properties
- It describes the Brownian motion.
- It is the integral of the white noise generalized by the Gaussian process
6.5. Gaussian Process
6.5.1. Definition
A continuous stochastic process \( \{ X_t ; t\in T \}\) is Gaussian if and only if for every finite set of indices \( t_1, \ldots, t_k \) in the index set \( T \) \[ \mathbf{X}_{t_1,\dots, t_k} = (X_{t_1}, \dots, X_{t_k}) \] is a multivariate Gaussian random variable.
6.5.2. Covariance Function
The second-order statistics completely defines a Gaussian process.
The variances and covariances can be given by the covariance function \( K(x,x') \), and together with the restriction of the domain, they completely determines the probability density over functions with a continuous domain.
6.5.3. Kriging
- Gaussian Process Regression
The prior distribution is determined by suitable choice of hyperparameters for the covariance function (or kernel), and the observation is used to update the prior.
It is a form of Bayesian inference, using the Gaussian process as a prior probability distribution.
6.6. Properties
- A stochastic process \( (X_t)_{t\in T} \) on the probability space \( (\Omega, \Sigma, \mathrm{P}) \)
generates the natural filtration \( (\Sigma_t)_{t\in T} \) of \( \Sigma \):
\[
\Sigma_t := \sigma(X_k \mid k \le t).
\]
- The probability space is the product space of all domains of \( (X_t)_{t\in\mathbb{T}} \).
7. References
- Marginal distribution - Wikipedia
- Conditional probability distribution - Wikipedia
- Bayesian probability - Wikipedia
- Probability space - Wikipedia
- Random variable - Wikipedia
- Probability distribution - Wikipedia
- Probability distribution - Wikipedia
- Cauchy distribution - Wikipedia
- Exponential distribution - Wikipedia
- Beta distribution - Wikipedia
- Beta prime distribution - Wikipedia
- The Beta Distribution : Data Science Basics - YouTube
- Poisson point process - Wikipedia
- Quadratic variation - Wikipedia
- Martingale (probability theory) - Wikipedia
- Local martingale - Wikipedia
- Brownian motion - Wikipedia
- Wiener process - Wikipedia
- Gaussian process - Wikipedia
- Gaussian Processes - YouTube